{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Other Helper Methods\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table Joins\n", "\n", "ParquetDB provides a custom `join_tables` function that extends beyond the built-in [PyArrow `join`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join) method, allowing you to handle **custom extension types** and **more complex data types** in your joins. \n", "\n", "This notebook demonstrates how to perform various join operations—such as `left semi`, `right semi`, `left anti`, `right anti`, `inner`, `left outer`, `right outer`, `full outer` joins—on two PyArrow tables.\n", "\n", "\n", "While PyArrow’s built-in `join` is powerful, certain use cases may involve:\n", "\n", "- **Custom extension types** that PyArrow doesn’t support out-of-the-box.\n", "- **Complex or nested types** that require additional logic during joins (e.g., arrays of structs, custom objects, etc.).\n", "\n", "\n", "\n", "Below is the full implementation of the custom `join_tables` function. It closely mimics the logic of PyArrow’s built-in `Table.join` but adds:\n", "\n", "1. Index columns (`left_index` and `right_index`) to preserve the original row ordering.\n", "2. Logic to **coalesce keys** (if `coalesce_keys=True`).\n", "3. Automatic handling of **suffixes** for overlapping columns (`left_suffix` and `right_suffix`).\n", "4. The ability to seamlessly merge **custom extension types** and **complex data** that might otherwise be incompatible with the standard PyArrow join.\n", "\n", "```python\n", "def join_tables(\n", " left_table: pa.Table,\n", " right_table: pa.Table,\n", " left_keys,\n", " right_keys=None,\n", " join_type=\"left outer\",\n", " left_suffix=None,\n", " right_suffix=None,\n", " coalesce_keys=True,\n", "):\n", " \"\"\"\n", " Custom join operation for PyArrow Tables, accommodating complex or extension types\n", " and additional logic for suffixes and metadata merging.\n", "\n", " Parameters\n", " ----------\n", " left_table : pa.Table\n", " The left-side table to join.\n", " right_table : pa.Table\n", " The right-side table to join.\n", " left_keys : list or str\n", " Column name(s) in the left table for the join.\n", " right_keys : list or str, optional\n", " Column name(s) in the right table for the join.\n", " join_type : str, optional\n", " Type of join to perform. E.g., 'left outer', 'right outer', 'inner', 'full outer',\n", " 'left semi', 'right semi', 'left anti', 'right anti'. Defaults to 'left outer'.\n", " left_suffix : str, optional\n", " Suffix for overlapping column names from the left table.\n", " right_suffix : str, optional\n", " Suffix for overlapping column names from the right table.\n", " coalesce_keys : bool, optional\n", " Whether to coalesce join keys if columns have null values. Defaults to True.\n", "```\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " id_1 id_2 field_1\n", "0 100 10 left_1\n", "1 33 12 None\n", "2 12 13 None\n", " id_1 id_2 field_2\n", "0 100 10 right_1\n", "1 5 5 None\n", "2 33 13 None\n", "3 33 12 right_2\n", " field_2 id_1 id_2 field_1\n", "0 right_1 100 10 left_1\n", "1 right_2 33 12 None\n", "2 None 5 5 None\n", "3 None 33 13 None\n" ] } ], "source": [ "import pyarrow as pa\n", "from parquetdb.utils import pyarrow_utils\n", "\n", "# Construct two sample tables using ParquetDB-like logic\n", "left_data = [\n", " {\"id_1\": 100, \"id_2\": 10, \"field_1\": \"left_1\"},\n", " {\"id_1\": 33, \"id_2\": 12},\n", " {\"id_1\": 12, \"id_2\": 13, \"field_2\": \"left_2\"},\n", "]\n", "\n", "right_data = [\n", " {\"id_1\": 100, \"id_2\": 10, \"field_2\": \"right_1\"},\n", " {\"id_1\": 5, \"id_2\": 5},\n", " {\"id_1\": 33, \"id_2\": 13, \"extra_field\": \"right_extra\"},\n", " {\"id_1\": 33, \"id_2\": 12, \"field_2\": \"right_2\"},\n", "]\n", "\n", "# Convert to PyArrow tables\n", "left_table = pa.Table.from_pylist(left_data)\n", "right_table = pa.Table.from_pylist(right_data)\n", "\n", "df_left = left_table.to_pandas()\n", "df_right = right_table.to_pandas()\n", "\n", "print(df_left)\n", "print(df_right)\n", "\n", "# Perform a left outer join using built-in PyArrow\n", "pyarrow_join_result = right_table.join(\n", " left_table,\n", " keys=[\"id_1\", \"id_2\"],\n", " right_keys=[\"id_1\", \"id_2\"],\n", " join_type=\"left outer\",\n", " left_suffix=\"_right\",\n", " right_suffix=\"_left\", # reversed to illustrate differences\n", ")\n", "\n", "# Perform the same join with our custom join_tables\n", "custom_join_result = pyarrow_utils.join_tables(\n", " right_table,\n", " left_table,\n", " left_keys=[\"id_1\", \"id_2\"],\n", " right_keys=[\"id_1\", \"id_2\"],\n", " join_type=\"left outer\",\n", " left_suffix=\"_right\",\n", " right_suffix=\"_left\",\n", " coalesce_keys=True,\n", ")\n", "\n", "\n", "df_custom_join = custom_join_result.to_pandas()\n", "\n", "print(df_custom_join)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Drop Duplicates\n", "\n", "ParquetDB also provides a `drop_duplicates` function that allows you to drop duplicate rows from a PyArrow Table based on specified keys, keeping the first occurrence.\n", "\n", "\n", "```python\n", "def drop_duplicates(table, keys):\n", " \"\"\"\n", " Drops duplicate rows from a PyArrow Table based on the specified keys, \n", " keeping the first occurrence.\n", "\n", " Parameters\n", " ----------\n", " table : pyarrow.Table\n", " The input table from which duplicates will be removed.\n", " keys : list of str\n", " A list of column names that determine the uniqueness of rows.\n", "\n", " Returns\n", " -------\n", " pyarrow.Table\n", " A new table with duplicates removed, keeping the first occurrence \n", " of each unique key combination.\n", " \"\"\"\n", "```\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original Table:\n", " id name category\n", "0 0 Alice 1\n", "1 1 Bob 1\n", "2 2 Bob 1\n", "3 3 Charlie 2\n", "4 4 Alice 1\n", "\n", "Deduplicated Table (keeping first occurrence):\n", " id name category\n", "0 0 Alice 1\n", "1 1 Bob 1\n", "2 3 Charlie 2\n" ] } ], "source": [ "data = [\n", " {\"id\": 0, \"name\": \"Alice\", \"category\": 1},\n", " {\"id\": 1, \"name\": \"Bob\", \"category\": 1},\n", " {\n", " \"id\": 2,\n", " \"name\": \"Bob\",\n", " \"category\": 1,\n", " }, # Duplicate combination of (name, category)\n", " {\"id\": 3, \"name\": \"Charlie\", \"category\": 2},\n", " {\n", " \"id\": 4,\n", " \"name\": \"Alice\",\n", " \"category\": 1,\n", " }, # Another duplicate combination of (name, category)\n", "]\n", "\n", "# Convert to a PyArrow table\n", "table = pa.Table.from_pylist(data)\n", "\n", "# Specify the key columns that define uniqueness (excluding \"id\"—the function will add it automatically)\n", "unique_keys = [\"name\", \"category\"]\n", "\n", "# Drop duplicates\n", "deduplicated_table = pyarrow_utils.drop_duplicates(table, unique_keys)\n", "\n", "# Show results\n", "print(\"Original Table:\")\n", "print(table.to_pandas())\n", "\n", "print(\"\\nDeduplicated Table (keeping first occurrence):\")\n", "print(deduplicated_table.to_pandas())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "parquetdb_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.20" } }, "nbformat": 4, "nbformat_minor": 2 }